This blog post will outline my completion of PIC 16B Homework 6. I will describe the creation of a machine learning model that can distinguish between fake and real news.¶
- We first must import all necessary packages that will be used in the creation of the model.
In [1]:
# packages to form the datasets:
import numpy as np
import pandas as pd
import tensorflow as tf
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
import re
import string
# packages to build and test the models
import keras
from keras import layers, losses
from keras import utils
from keras.layers import TextVectorization
# packages to visualize results:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import plotly.express as px
print("keras version ", keras.__version__)
keras version 3.0.5
- It is also specified above that we are using keras 3 in this homework.
- Now that we have all the necessary packages imported, we will start looking at the data.
1. Acquire Training Data¶
- We load in the article data from an outside source, as shown below.
In [2]:
# read in the training data as a pandas dataframe
train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv(train_url)
df.head()
Out[2]:
| Unnamed: 0 | title | text | fake | |
|---|---|---|---|---|
| 0 | 17366 | Merkel: Strong result for Austria's FPO 'big c... | German Chancellor Angela Merkel said on Monday... | 0 |
| 1 | 5634 | Trump says Pence will lead voter fraud panel | WEST PALM BEACH, Fla.President Donald Trump sa... | 0 |
| 2 | 17487 | JUST IN: SUSPECTED LEAKER and “Close Confidant... | On December 5, 2017, Circa s Sara Carter warne... | 1 |
| 3 | 12217 | Thyssenkrupp has offered help to Argentina ove... | Germany s Thyssenkrupp, has offered assistance... | 0 |
| 4 | 5535 | Trump say appeals court decision on travel ban... | President Donald Trump on Thursday called the ... | 0 |
- As we can see, the data has columns for the title of the article, the text of the article, and whether or not the article is fake. We will attempt to predict the fake column using the title and text columns.
2. Make a Dataset¶
- As of right now, we have a pandas dataframe containing all of our data. We must convert that into a tensorflow dataset in order to use the data in our model.
make_dataset() function¶
- We will create a function that takes in a pandas dataframe, converts all of its strings to lowercase, removes stop words, and outputs a tensorflow dataset that is batched to optimize training time.
- This function is outlined below.
In [3]:
def make_dataset(df):
'''
Converts all strings to lowercase, removes stopwords, and returns a
tensorflow dataset.
Args:
df: a pandas dataframe of article data
Returns:
a tensorflow dataset with inputs title and text and output fake
'''
# convert the title and text columns to lowercase
df['title'] = df['title'].str.lower()
df['text'] = df['text'].str.lower()
# remove all stopwords from the title and text columns
stop = stopwords.words('english')
stop_fun = lambda x: ' '.join([word for word in x.split() if word not in stop])
df['title'] = df['title'].apply(stop_fun)
df['text'] = df['text'].apply(stop_fun)
# create a tensorflow dataset with inputs title and text and output fake
output = tf.data.Dataset.from_tensor_slices(
(
{
'title':df[['title']],
'text':df[['text']]
},
{
'fake':df[['fake']]
}
)
)
# batch the dataset by 100 and return
return output.batch(100)
- Now that we have a function to create a dataset, we will run it on our pandas dataframe from before.
In [4]:
# create our main tensorflow dataset
df = make_dataset(df)
- We now have a tensorflow dataset of our data, so we need to split it into training and validation to proceed with model creation.
Validation Data¶
- We will split the data into 20% validation and 80% training data to prepare for the creation of our models.
In [5]:
# specify validation size to be 20% of dataset
val_size = int(0.2*len(df))
# get validation and training datasets
val = df.take(val_size)
train = df.skip(val_size)
- Now we have two datasets that we can train our model with, so we should establish a baseline performance expectation for our model
Base Rate¶
- We will examine the proportion of fake and real news in our training data.
In [6]:
out = np.empty(0)
# for each entry to the dataset
for article, fake in train:
# add the "fake" column entries to the output array
out = np.append(out, fake['fake'].numpy().flatten())
print("Proportion of fake news: ", np.mean(out))
print("Proportion of real news: ", 1 - np.mean(out))
Proportion of fake news: 0.5243746169703047 Proportion of real news: 0.47562538302969526
- The proportion of fake news in our training data is 0.5244, so our model must have accuracy higher than 0.5244 to outperform the base rate.
Text Vectorization¶
- We now need to convert the string inputs in our model into integers in order to input our data into the model.
- We will do this by ordering the words by most to least common, keeping only the top 2000 most common words.
- This process is outlined below.
In [7]:
#preparing a text vectorization layer for tf model
size_vocabulary = 2000
# convert the strings to lowercase and remove all punctuation
def standardization(input_data):
lowercase = tf.strings.lower(input_data)
no_punctuation = tf.strings.regex_replace(lowercase,
'[%s]' % re.escape(string.punctuation),'')
return no_punctuation
# create a layer that converrts strings to integers
vectorize_layer = TextVectorization(
standardize=standardization,
max_tokens=size_vocabulary, # only consider this many words
output_mode='int',
output_sequence_length=500)
# apply the layer to the title and text columns in the training data
vectorize_layer.adapt(train.map(lambda x, y: x["title"]))
vectorize_layer.adapt(train.map(lambda x, y: x["text"]))
- Now that we have successfully vectorized the strings in our training data, we can proceed with model testing.
3. Create Models¶
- We will now procedd to model creation.
- We already have a text vectorization layer that will be used in our models, as shown above. We must now create an embedding layer, which is done below.
In [8]:
# create an embeddding layer of size 2000 (same as specified above)
embedding_layer = layers.Embedding(size_vocabulary, 3)
- We will create one model using only the title of the articles, one model using only the text of the articles, and one model using both.
- This function will handle the creation of these models.
In [9]:
def get_model(preds):
'''
Returns a neural network model using keras 3 to predict whether an article
is fake news.
Args:
preds: "title", "text", or "both"
Returns:
a model using the specified predictior(s)
'''
# if we want to use the title or both title and text
if (preds == "title") | (preds == "both"):
# create the title input to the model
title_input = keras.Input(
shape = (1,),
name = "title",
dtype = "string"
)
# create the layers for the title predictor
title_features = vectorize_layer(title_input)
title_features = embedding_layer(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.GlobalAveragePooling1D()(title_features)
title_features = layers.Dropout(0.2)(title_features)
title_features = layers.Dense(32, activation='relu')(title_features)
if (preds == "text") | (preds == "both"):
# create the text input to the model
text_input = keras.Input(
shape = (1,),
name = "text",
dtype = "string"
)
# create the layers for the text predictor
text_features = vectorize_layer(text_input)
text_features = embedding_layer(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.GlobalAveragePooling1D()(text_features)
text_features = layers.Dropout(0.2)(text_features)
text_features = layers.Dense(32, activation='relu')(text_features)
if preds == "both":
# put both title and text layers into the main part of the model
main = layers.concatenate([title_features, text_features], axis = 1)
main = layers.Dense(32, activation='relu')(main)
# create the model output using main
output = layers.Dense(2, name = "fake")(main)
# create the model using both title and text predictors
model = keras.Model(inputs = [title_input, text_input], outputs = output)
elif preds == "title":
# create the main part of the model using the title features
main = layers.Dense(32, activation='relu')(title_features)
output = layers.Dense(2, name = "fake")(main)
# create the model using only the title input
model = keras.Model(inputs = [title_input], outputs = output)
else:
# create the main part of the model using the text features
main = layers.Dense(32, activation='relu')(text_features)
output = layers.Dense(2, name = "fake")(main)
# create the model using only the text input
model = keras.Model(inputs = [text_input], outputs = output)
# compile the model
model.compile(optimizer = "adam",
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
return model
- Now that we can create a model, we need to fit it using our training and validation data.
- The below function allows us to fit our model using our train and val datasets for a given number of epochs.
In [10]:
def fit_model(model, train, val, epochs):
'''
Fits a neural network model on a training and validation dataset for a given
number of epochs.
Args:
model: a keras model created by get_model()
train: a training dataset created by get_dataset()
val: a validation dataset created by get_dataset()
epochs: the number of epochs to fit the model on
Returns:
the history output of fitting the model
'''
# create a stopping condition for the fitting of the model
callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
return model.fit(train,
validation_data = val,
epochs = epochs,
callbacks = [callback],
verbose = True)
- We can now create a fit our model, wo we want a way to assess our model's accuracy.
- The function below displays a plot that shows the model's training and validation accuracy for each epoch that the model is trained on.
In [11]:
def plot_model_accuracy(history, model_type):
'''
Plots the training and validation accuracy of a model for each epoch.
Args:
history: the output from fitting the model
model_type: the name of the model
Returns:
a line plot showing the training and validation accuracy of the model
'''
# plot the training accuracy
plt.plot(history.history["accuracy"],label='training')
# plot the validation accuracy
plt.plot(history.history["val_accuracy"],label='validation')
# give the plot a descriptive title
plt.title(f"Training and Validation Accuracy for {model_type} Model")
plt.legend()
- We now have the framework to create, fit, and display all of our desired models. We will now proceed to actually creating these models using our training and validation data.
Model Using Only Title¶
- Our first model will only use the title of the article to predict whether it is fake or real.
- The model is created below and its summary is outputted.
In [12]:
# create model using only title as a predictor and output its summary
model_title = get_model("title")
model_title.summary()
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ title (InputLayer) │ (None, 1) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ text_vectorization │ (None, 500) │ 0 │ │ (TextVectorization) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ embedding (Embedding) │ (None, 500, 3) │ 6,000 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout (Dropout) │ (None, 500, 3) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ global_average_pooling1d │ (None, 3) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_1 (Dropout) │ (None, 3) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 32) │ 128 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 32) │ 1,056 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ fake (Dense) │ (None, 2) │ 66 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 7,250 (28.32 KB)
Trainable params: 7,250 (28.32 KB)
Non-trainable params: 0 (0.00 B)
- We will now display the model in an easy-to read manner, done below.
In [13]:
# plot the model layers in a flow chart format
utils.plot_model(model_title, "model_title.png",
show_shapes=True,
show_layer_names=True)
Out[13]:
- As we can see, the model has the following layers: input, text vectorization, embedding, dropout, global average pooling 1D, dropout, 3 dense.
- Now that we have created the model, we will fit the model using the training and validation sets above and 50 epochs.
In [14]:
# fit the model on our training data, testing it on the validation data
history_title = fit_model(model_title, train, val, 50)
Epoch 1/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 8s 7ms/step - accuracy: 0.5240 - loss: 0.6922 - val_accuracy: 0.5173 - val_loss: 0.6892 Epoch 2/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.5633 - loss: 0.6810 - val_accuracy: 0.5731 - val_loss: 0.6190 Epoch 3/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.6860 - loss: 0.5798 - val_accuracy: 0.7824 - val_loss: 0.4840 Epoch 4/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.7370 - loss: 0.5227 - val_accuracy: 0.7853 - val_loss: 0.4626 Epoch 5/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 10ms/step - accuracy: 0.7598 - loss: 0.4886 - val_accuracy: 0.7938 - val_loss: 0.4436 Epoch 6/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.7673 - loss: 0.4812 - val_accuracy: 0.8069 - val_loss: 0.4216 Epoch 7/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7782 - loss: 0.4603 - val_accuracy: 0.8218 - val_loss: 0.4080 Epoch 8/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7811 - loss: 0.4598 - val_accuracy: 0.8156 - val_loss: 0.4096 Epoch 9/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.7982 - loss: 0.4363 - val_accuracy: 0.8358 - val_loss: 0.3883 Epoch 10/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step - accuracy: 0.7953 - loss: 0.4386 - val_accuracy: 0.8407 - val_loss: 0.3711 Epoch 11/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.8021 - loss: 0.4322 - val_accuracy: 0.8147 - val_loss: 0.4076 Epoch 12/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 8ms/step - accuracy: 0.8009 - loss: 0.4259 - val_accuracy: 0.8498 - val_loss: 0.3626 Epoch 13/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.8193 - loss: 0.4018 - val_accuracy: 0.8564 - val_loss: 0.3437 Epoch 14/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.8054 - loss: 0.4173 - val_accuracy: 0.7980 - val_loss: 0.4135 Epoch 15/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.8258 - loss: 0.3883 - val_accuracy: 0.8284 - val_loss: 0.3629 Epoch 16/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.8246 - loss: 0.3837 - val_accuracy: 0.7960 - val_loss: 0.4116 Epoch 17/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.8334 - loss: 0.3684 - val_accuracy: 0.8069 - val_loss: 0.3955 Epoch 18/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.8409 - loss: 0.3552 - val_accuracy: 0.8751 - val_loss: 0.2955 Epoch 19/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.8478 - loss: 0.3413 - val_accuracy: 0.8496 - val_loss: 0.3242 Epoch 20/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.8517 - loss: 0.3375 - val_accuracy: 0.8418 - val_loss: 0.3339 Epoch 21/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.8643 - loss: 0.3227 - val_accuracy: 0.8213 - val_loss: 0.3710 Epoch 22/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.8608 - loss: 0.3216 - val_accuracy: 0.8547 - val_loss: 0.3114 Epoch 23/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.8611 - loss: 0.3118 - val_accuracy: 0.8589 - val_loss: 0.3041
- The stop condition has stopped the fitting of the model after 23 epochs.
- As we can see, the model stabilized between 0.82 and 0.86 validation accuracy. This is pretty good, but we can certainly do better using more predictors.
- We will now visualize the accuracy of the model on the training and validation data.
In [15]:
# plot training and validation accuracy calculated during model fitting
plot_model_accuracy(history_title, "Title")
- As we can see, the validation accuracy is rather unpredictable throughout the model fitting process, but it stabilizes at about 0.85.
- Throughout the model training process, the validation accuracy is about the same as the training accuracy. Hence, we do not have an issue with overfitting in this model.
- We will now attempt to create a better model.
Model Using Only Text¶
- Our second model will only use the text of the article to predict whether it is fake or real.
- The model is created below and its summary is outputted.
In [16]:
# create model using only title as a predictor and output its summary
model_text = get_model("text")
model_text.summary()
Model: "functional_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ text (InputLayer) │ (None, 1) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ text_vectorization │ (None, 500) │ 0 │ │ (TextVectorization) │ │ │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ embedding (Embedding) │ (None, 500, 3) │ 6,000 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout_2 (Dropout) │ (None, 500, 3) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ global_average_pooling1d_1 │ (None, 3) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout_3 (Dropout) │ (None, 3) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_2 (Dense) │ (None, 32) │ 128 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_3 (Dense) │ (None, 32) │ 1,056 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ fake (Dense) │ (None, 2) │ 66 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 7,250 (28.32 KB)
Trainable params: 7,250 (28.32 KB)
Non-trainable params: 0 (0.00 B)
- We will now display the model in an easy-to-read manner, done below.
In [17]:
# plot the model layers in a flow chart format
utils.plot_model(model_text, "model_text.png",
show_shapes=True,
show_layer_names=True)
Out[17]:
- As we can see, the model has the following layers: input, text vectorization, embedding, dropout, global average pooling 1D, dropout, 3 dense.
- Now that we have created the model, we will fit the model using the training and validation sets above and 50 epochs.
In [18]:
# fit the model on our training data, testing it on the validation data
history_text = fit_model(model_text, train, val, 50)
Epoch 1/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 4s 15ms/step - accuracy: 0.5838 - loss: 0.6519 - val_accuracy: 0.8429 - val_loss: 0.3745 Epoch 2/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 4s 11ms/step - accuracy: 0.8719 - loss: 0.3239 - val_accuracy: 0.8991 - val_loss: 0.2387 Epoch 3/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 12ms/step - accuracy: 0.9171 - loss: 0.2319 - val_accuracy: 0.9127 - val_loss: 0.2092 Epoch 4/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - accuracy: 0.9284 - loss: 0.1982 - val_accuracy: 0.9193 - val_loss: 0.1918 Epoch 5/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 4s 11ms/step - accuracy: 0.9406 - loss: 0.1791 - val_accuracy: 0.9231 - val_loss: 0.1851 Epoch 6/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9416 - loss: 0.1665 - val_accuracy: 0.9242 - val_loss: 0.1828 Epoch 7/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 11ms/step - accuracy: 0.9450 - loss: 0.1587 - val_accuracy: 0.9562 - val_loss: 0.1539 Epoch 8/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 13ms/step - accuracy: 0.9481 - loss: 0.1480 - val_accuracy: 0.9367 - val_loss: 0.1609 Epoch 9/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 14ms/step - accuracy: 0.9499 - loss: 0.1396 - val_accuracy: 0.9371 - val_loss: 0.1575 Epoch 10/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 14ms/step - accuracy: 0.9550 - loss: 0.1334 - val_accuracy: 0.9624 - val_loss: 0.1386 Epoch 11/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9546 - loss: 0.1280 - val_accuracy: 0.9427 - val_loss: 0.1455 Epoch 12/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9603 - loss: 0.1199 - val_accuracy: 0.9416 - val_loss: 0.1459 Epoch 13/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 14ms/step - accuracy: 0.9563 - loss: 0.1214 - val_accuracy: 0.9418 - val_loss: 0.1447 Epoch 14/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 12ms/step - accuracy: 0.9596 - loss: 0.1140 - val_accuracy: 0.9631 - val_loss: 0.1337 Epoch 15/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9611 - loss: 0.1102 - val_accuracy: 0.9407 - val_loss: 0.1492 Epoch 16/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9622 - loss: 0.1072 - val_accuracy: 0.9400 - val_loss: 0.1528 Epoch 17/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 15ms/step - accuracy: 0.9638 - loss: 0.1045 - val_accuracy: 0.9391 - val_loss: 0.1529 Epoch 18/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 14ms/step - accuracy: 0.9659 - loss: 0.1006 - val_accuracy: 0.9431 - val_loss: 0.1425 Epoch 19/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 13ms/step - accuracy: 0.9635 - loss: 0.1035 - val_accuracy: 0.9480 - val_loss: 0.1295 Epoch 20/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9687 - loss: 0.0925 - val_accuracy: 0.9422 - val_loss: 0.1474 Epoch 21/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9652 - loss: 0.0954 - val_accuracy: 0.9498 - val_loss: 0.1258 Epoch 22/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 11ms/step - accuracy: 0.9670 - loss: 0.0943 - val_accuracy: 0.9718 - val_loss: 0.1146 Epoch 23/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 14ms/step - accuracy: 0.9689 - loss: 0.0931 - val_accuracy: 0.9478 - val_loss: 0.1313 Epoch 24/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 17ms/step - accuracy: 0.9719 - loss: 0.0845 - val_accuracy: 0.9398 - val_loss: 0.1534 Epoch 25/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9690 - loss: 0.0870 - val_accuracy: 0.9431 - val_loss: 0.1481 Epoch 26/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9710 - loss: 0.0833 - val_accuracy: 0.9467 - val_loss: 0.1355 Epoch 27/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 11ms/step - accuracy: 0.9701 - loss: 0.0813 - val_accuracy: 0.9433 - val_loss: 0.1454
- The stop condition has stopped the fitting of the model after 27 epochs.
- As we can see, the model stabilized between 0.93 and 0.95 validation accuracy. This is very good, but we may be able to do even better using both the title and text predictors.
- We will now visualize the accuracy of the model on the training and validation data.
In [19]:
# plot training and validation accuracy calculated during model fitting
plot_model_accuracy(history_text, "Text")
- Throughout the model fitting process, the validation accuracy is slightly lower than the trainign accuracy. Hence, we may have a slight issue with overfitting in this model.
- The validation accuracy seems to be very steady throughout the fitting process, staying between 0.92 and 0.95. Ths is very promising.
- We will now attempt to create a better model.
Model Using Both Title and Text¶
- Our third and final model will use both the title and the text of an article to predict whether it is fake or real.
- The model is created below and its summary is outputted.
In [16]:
# create model using title and text as predictors and output its summary
model_both = get_model("both")
model_both.summary()
Model: "functional_5"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ title (InputLayer) │ (None, 1) │ 0 │ - │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ text (InputLayer) │ (None, 1) │ 0 │ - │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ text_vectorization │ (None, 500) │ 0 │ title[0][0], │ │ (TextVectorization) │ │ │ text[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ embedding │ (None, 500, 3) │ 6,000 │ text_vectorizati… │ │ (Embedding) │ │ │ text_vectorizati… │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_6 (Dropout) │ (None, 500, 3) │ 0 │ embedding[3][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_8 (Dropout) │ (None, 500, 3) │ 0 │ embedding[4][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ global_average_poo… │ (None, 3) │ 0 │ dropout_6[0][0] │ │ (GlobalAveragePool… │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ global_average_poo… │ (None, 3) │ 0 │ dropout_8[0][0] │ │ (GlobalAveragePool… │ │ │ │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_7 (Dropout) │ (None, 3) │ 0 │ global_average_p… │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dropout_9 (Dropout) │ (None, 3) │ 0 │ global_average_p… │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_5 (Dense) │ (None, 32) │ 128 │ dropout_7[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_6 (Dense) │ (None, 32) │ 128 │ dropout_9[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ concatenate_1 │ (None, 64) │ 0 │ dense_5[0][0], │ │ (Concatenate) │ │ │ dense_6[0][0] │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ dense_7 (Dense) │ (None, 32) │ 2,080 │ concatenate_1[0]… │ ├─────────────────────┼───────────────────┼────────────┼───────────────────┤ │ fake (Dense) │ (None, 2) │ 66 │ dense_7[0][0] │ └─────────────────────┴───────────────────┴────────────┴───────────────────┘
Total params: 8,402 (32.82 KB)
Trainable params: 8,402 (32.82 KB)
Non-trainable params: 0 (0.00 B)
- We will now display the model in an easy-to-read manner, done below.
In [13]:
# plot the model layers in a flow chart format
utils.plot_model(model_both, "model_both.png",
show_shapes=True,
show_layer_names=True)
Out[13]:
- As we can see, the model has two input layers: one for title and one for text. These layers share a text vectorization layer and an embedding layer. They then have separate layers for dropout, global average pooling, dropout, and dense. They are then cocatenated into a single layer. Then there is another two dense layers.
- Now that we have created the model, we will fit the model using the training and validation sets above and 50 epochs.
In [17]:
# fit the model on our training data, testing it on the validation data
history_both = fit_model(model_both, train, val, 50)
Epoch 1/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 3s 10ms/step - accuracy: 0.7725 - loss: 0.5533 - val_accuracy: 0.9736 - val_loss: 0.1377 Epoch 2/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9685 - loss: 0.1189 - val_accuracy: 0.9709 - val_loss: 0.0991 Epoch 3/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9762 - loss: 0.0790 - val_accuracy: 0.9729 - val_loss: 0.0940 Epoch 4/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9769 - loss: 0.0666 - val_accuracy: 0.9784 - val_loss: 0.0937 Epoch 5/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9797 - loss: 0.0594 - val_accuracy: 0.9747 - val_loss: 0.0935 Epoch 6/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9786 - loss: 0.0572 - val_accuracy: 0.9764 - val_loss: 0.0923 Epoch 7/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9800 - loss: 0.0539 - val_accuracy: 0.9709 - val_loss: 0.1038 Epoch 8/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step - accuracy: 0.9802 - loss: 0.0522 - val_accuracy: 0.9780 - val_loss: 0.1008 Epoch 9/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 9ms/step - accuracy: 0.9751 - loss: 0.0595 - val_accuracy: 0.9773 - val_loss: 0.0941 Epoch 10/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step - accuracy: 0.9794 - loss: 0.0561 - val_accuracy: 0.9769 - val_loss: 0.0962 Epoch 11/50 180/180 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step - accuracy: 0.9817 - loss: 0.0538 - val_accuracy: 0.9798 - val_loss: 0.0951
- The stop condition has stopped the fitting of the model after 11 epochs.
- As we can see, the model stabilized between 0.975 and 0.98 validation accuracy. This is our best result yet, so this will be our final model.
- We will now visualize the accuracy of the model on the training and validation data.
In [18]:
# plot training and validation accuracy calculated during model fitting
plot_model_accuracy(history_both, "Both Text and Title")
- As we can see, the validation accuracy is slightly lower than the training accuracy throughout the model fitting. This indicates that we may have a slight issue with overfitting in this model. However, the training and validation accuracy never differs by more than 0.01, so we can neglect this concern.
- The validation accuracy is steady between 0.975 and 0.98 after the first 2 epochs. Ths is our best model yet, so we will use it as our final model.
4. Model Evaluation¶
- We will now evaluate our best model, the model using both title and text, on some test data.
- This data is loaded in, converted to a dataset, and evaulated on the model below.
In [19]:
# load in the test data as a pandas dataframe
test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"
test = pd.read_csv(test_url)
# convert the test data into a tensorflow dataset
test = make_dataset(test)
# evaluate the model on the test data
model_both.evaluate(test)
225/225 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9781 - loss: 0.0850
Out[19]:
[0.08292855322360992, 0.9786182045936584]
- The model has 0.9786 accuracy on the testing data, so it is very effective in predicting whether articles are fake or real news.
- We are very happy with this high accuracy.
5. Embedding Visualization¶
- We will now visualize the embedding layer of our model.
- We want to find specific words that the model found to be especially useful in discerning fake news from real news.
- The below code gets the weights from the embedding layer, gets the vocabulary from our vectoriation layer, and uses PCA to shrink the dimensions of the vectors to 2-dimensional.
- We then create a dataframe with the word and its two-dimensional vector expression.
In [20]:
# get the weights from the embedding layer
weights = model_both.get_layer('embedding').get_weights()[0]
# get the vocabulary from our data prep for later
vocab = vectorize_layer.get_vocabulary()
# perform PCA to reduce the dimensions of the word weights to 2
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)
# create a dataframe of the weights of each word in the model
embedding_df = pd.DataFrame({
'word' : vocab,
'x0' : weights[:,0],
'x1' : weights[:,1]
})
- We will now use this dataframe to display the words in a scatterplot to establiosh any correlations or groupings that our model has determined.
In [21]:
# plot a scatterplot with with the weights of each word on the x and y axis
fig = px.scatter(embedding_df,
x = "x0",
y = "x1",
size = list(np.ones(len(embedding_df))),
size_max = 5,
hover_name = "word")
fig.show()
- Upon viewing the words on the right side of the scatterplot, these appear to be associated with fake news. The words on the left side appear to be associated with real news.
- We will now look at 5 specific words that stick out in the scatterplot above.
- trumps: This word appears on the far right of the scatterplot. This is very interesting because it indicates that the model associates the former US president's name with fake news. This may be due to the fact that there is so much media about the president that a lot of it ends up being fake.
- we: This word appears the second to the furthest right. This makes a lot of sense because the word we is often associated with subjective views which are often found in opinion pieces, which may be portrayed as fake news by our model.
- i: This word appears the third to the furthest right. Very similar to the word we, this word is often associated with subjective views, which may be portrayed as fake news by our model.
- gop: This word appears on the far left of the visualization, which means that our model associates it with real news. The word GOP is normally associated with the US republican party, so our model predicts that artucles about the republican party are generally real news.
- rep: This word appears second to furthest to the left in the visualization. Similar to the word gop, the word rep is also associated with the US republican party. This reinforces the idea that our model associates articles about the republican party with real news.
In [ ]: